85 research outputs found

    The LKPY Package for Recommender Systems Experiments

    Get PDF
    Since 2010, we have built and maintained LensKit, an open-source toolkit for building, researching, and learning about recommender systems. We have successfully used the software in a wide range of recommender systems experiments, to support education in traditional classroom and online settings, and as the algorithmic backend for user-facing recommendation services in movies and books. This experience, along with community feedback, has surfaced a number of challenges with LensKit’s design and environmental choices. In response to these challenges, we are developing a new set of tools that leverage the PyData stack to enable the kinds of research experiments and educational experiences that we have been able to deliver with LensKit, along with new experimental structures that the existing code makes difficult. The result is a set of research tools that should significantly increase research velocity and provide much smoother integration with other software such as Keras while maintaining the same level of reproducibility as a LensKit experiment. In this paper, we reflect on the LensKit project, particularly on our experience using it for offline evaluation experiments, and describe the next-generation LKPY tools for enabling new offline evaluations and experiments with flexible, open-ended designs and well-tested evaluation primitives

    Candidate Set Sampling for Evaluating Top-N Recommendation

    Full text link
    The strategy for selecting candidate sets -- the set of items that the recommendation system is expected to rank for each user -- is an important decision in carrying out an offline top-NN recommender system evaluation. The set of candidates is composed of the union of the user's test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments

    Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

    Get PDF
    This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results. In this paper, we argue that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention. We support this argument with systematic review of recent RecSys papers to understand how statistical inference is currently being used, along with a brief survey of studies that have been done on the use of statistical inference in the information retrieval community. We present several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques

    Measuring Fairness in Ranked Results: An Analytical and Empirical Comparison

    Get PDF
    Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user\u27s information need. Evaluating these lists for their fairness along with other traditional metrics provides a more complete understanding of an information access system\u27s behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to the protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking. We aim to bridge the gap between theoretical and practical application of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement

    Monte Carlo Estimates of Evaluation Metric Error and Bias: Work in Progress

    Get PDF
    Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions

    Behaviorism is Not Enough: Better Recommendations Through Listening to Users

    Get PDF
    Behaviorism is the currently-dominant paradigm for building and evaluating recommender systems. Both the operation and the evaluation of recommender system applications are most often driven by analyzing the behavior of users. In this paper, we argue that listening to what users say — about the items and recommendations they like, the control they wish to exert on the output, and the ways in which they perceive the system — and not just observing what they do will enable important developments in the future of recommender systems. We provide both philosophical and pragmatic motivations for this idea, describe the various points in the recommendation and evaluation processes where explicit user input may be considered, and discuss benefits that may result from considered incorporation of user preferences at each of these points. In particular, we envision recommender applications that aim to support users’ better selves: helping them live the life that they desire to lead. For example, recommender-assisted behavior change requires algorithms to predict not what users choose or do now, inferable from behavioral data, but what they should choose or do in the future to become healthier, fitter, more sustainable, or culturally aware. We hope that our work will spur useful discussion and many new ideas for recommenders that empower their users

    Recommender Systems Notation

    Get PDF
    As the field of recommender systems has developed, authors have used a myriad of notations for describing the mathematical workings of recommendation algorithms. These notations appear in research papers, books, lecture notes, blog posts, and software documentation. The disciplinary diversity of the field has not contributed to consistency in notation; scholars whose home base is in information retrieval have different habits and expectations than those in machine learning or human-computer interaction. In the course of years of teaching and research on recommender systems, we have seen the value in adopting a consistent notation across our work. This has been particularly highlighted in our development of the Recommender Systems MOOC on Coursera (Konstan et al. 2015), as we need to explain a wide variety of algorithms and our learners are not well-served by changing notation between algorithms. In this paper, we describe the notation we have adopted in our work, along with its justification and some discussion of considered alternatives. We present this in hope that it will be useful to others writing and teaching about recommender systems. This notation has served us well for some time now, in research, online education, and traditional classroom instruction. We feel it is ready for broad use

    2nd FATREC Workshop: Responsible Recommendation

    Get PDF
    The second Workshop on Responsible Recommendation (FATREC 2018) was held in conjunction with the 12th ACM Conference on Recommender Systems on October 6th, 2018 in Vancouver, Canada. This full-day workshop brought together researchers and practitioners to discuss several topics under the banner of social responsibility in recommender systems: fairness, accountability, transparency, privacy, and other ethical and social concerns
    • …
    corecore